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Abstract. This work introduces the minimax Laplace transform method, a modification of the 
cumulant-based matrix Laplace transform method developed in |Trollc] that yields both upper and 
lower bounds on each eigenvalue of a sum of random self-adjoint matrices. This machinery is used 
to derive eigenvalue analogs of the classical Chernoff, Bennett, and Bernstein bounds. 

Two examples demonstrate the efficacy of the minimax Laplace transform. The first concerns 
the effects of column sparsification on the spectrum of a matrix with orthonormal rows. Here, the 
behavior of the singular values can be described in terms of coherence-like quantities. The second 
example addresses the question of relative accuracy in the estimation of eigenvalues of the covariance 
matrix of a random process. Standard results on the convergence of sample covariance matrices 
provide bounds on the number of samples needed to obtain relative accuracy in the spectral norm, 
but these results only guarantee relative accuracy in the estimate of the maximum eigenvalue. The 
minimax Laplace transform argument establishes that if the lowest eigenvalues decay sufficiently 
fast, Q,{£~^ Kjilogp) samples, where Ke — Ai(C)/Af(C), are sufficient to ensure that the dominant 
£ eigenvalues of the covariance matrix of a A/'(0, C) random vector are estimated to within a factor 
of f ± e with high probability. 



The field of nonasymptotic random matrix tlieory has traditionally focused on the problem of 
bounding the extreme eigenvalues of a random matrix. In some circumstances, however, we may 
also be interested in studying the behavior of the interior eigenvalues. In this case, classical tools 
do not readily apply. Indeed, the interior eigenvalues are determined by the min-max of a random 
process, which is very challenging to control. 

This paper demonstrates that it is possible to combine the matrix Laplace transform method 
detailed in [Trollc] with the Courant-Fischer characterization of eigenvalues to obtain nontrivial 
bounds on the interior eigenvalues of a sum of random self-adjoint matrices. This approach expands 
the scope of the matrix probability inequalities from [Trollc] so that they provide interesting 
information about the bulk spectrum. 

As one application of our approach, we investigate estimates for the covariance matrix of a 
centered stationary random process. We show that the eigenvalues of the sample covariance matrix 
provide relative-error approximations to the eigenvalues of the covariance matrix. We focus on 
Gaussian processes, but our arguments can be extended to other distributions. The following 
theorem distills the results in section [3 

Theorem 1.1. Let C £ W^^ be positive semidefinite. Fix an integer i < p and assume the tail 
{Aj(C)}j>£ of the spectrum of C decays sufficiently fast that 
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Let {J7j}"=i C W be i.i.d. samples drawn from aM{0, C) distribution. Define the sample covariance 
matrix 



^ 1 

Cn = - , VjVj- 

1 domin 
Ai(C) 



Let K£ be the condition number associated with a dominant l-dimensional invariant subspace of C, 



He 



If n = Q{e~^ Kjllogp) , then with high probability 

\Xk{Cn) - Xk{C)\ < eXkiC) fork = !,...,£. 



Thus, assuming sufficiently fast decay of the residual eigenvalues, n = Q{e~'^K'j£logp) samples 
ensure that the top i eigenvalues of C are captured to relative precision. Spectral decay of this 
sort is encountered when, e.g., the residual eigenvalues of C decay like k~^^~^^^ for some 6 > or 
when they arise from measurements corrupted by low-power white noise. 

We contrast Theorem 1 1 . 1 1 with established spectral norm error bounds for covariance estimation, 
which do not exploit spectral decay and require that n = Q{e~'^ Kjp) samples be taken to capture the 
top £ eigenvalues to relative precision (see section [7]). The estimate in Theorem 1 1 . 1 1 can be sharpened 
using information about the spectrum of C and the desired failure probability or modified to account 
for different types of spectral decay. The same tools used in the proof of the theorem can be used 
to estimate \k{Cn — C). 

1.1. Related Work. We believe that this paper contains the first general-purpose tools for study- 
ing the full spectrum of a finite-dimensional random matrix. The literature on random matrix 
theory (RMT) contains some complementary results, but they do not seem to apply with the same 
generality. Methods from RMT fall into two rough categories: asymptotic methods and nonasymp- 
totic methods. We discuss the relevant results from each in turn. 

The modern asymptotic theory began in the 1950s when physicists observed that, on certain 
scales, the behavior of a quantum system is described by the spectrum of a random matrix jMeh04j . 
They further observed the phenomenon of universality: as the dimension increases, the spectral 
statistics become independent of the distribution of the random matrix; instead, they are deter- 
mined by the symmetries of the distribution |Dei07] . Since these initial observations, physicists, 
statisticians, engineers, and mathematicians have found manifold applications of the asymptotic 
theory in high-dimensional statistics |JohOH IJohOTj lEl 08j . physics |GMGW981 IMeh04j . wireless 
communication [TV041 IST06j , and pure mathematics j RS96| IBK99j , to mention only a few areas. 

Asymptotic random matrix theory has developed primarily through the examination of specific 
classes of random matrices. We mention two well-studied classes. Sample covariance matrices 
take the form n~^S„-B*, where the columns of Bn comprise n independent observations. Wigner 
matrices are Hermitian matrices whose superdiagonal entries are independent, zero-mean, and have 
unit variance and whose diagonal entries are i.i.d., real, and have finite variance. 

The fundamental object of study in asymptotic random matrix theory is the empirical spectral 
distribution function (ESD). Given a random Hermitian matrix A of order n, its ESD 

F'^ix) = ^ #{1 <i<n: Xi{A) < x} 
n 

is a random distribution function which encodes the statistics of the spectrum of A. Wigner's 
theorem |Wig55| , the seminal result of the asymptotic theory, establishes that if {An} is a sequence 
of independent, symmetric n x n matrices with i.i.d. M{0, 1) entries on and above the diagonal. 
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then the expected ESD of n ^1"^ An converges weakly in probabiUty, as n approaches infinity, to the 
semicircular law given by 

^^""^ " 2I / v'4-y2l[_2 2] (y) dy. 

Thus, at least in the limiting sense, the spectra of these random matrices are well characterized. 
Development of the classical asymptotic theory has been driven by the natural question raised by 
Wigner's result: to what extent is the semicircular law, and more generally, the existence of a 
limiting spectral distribution (LSD) universal? 

The literature on the existence and universality of LSDs is massive; we mention only the high- 
lights. It is now known that the semicircular law is universal for Wigner matrices. Suppose that 
{ A„} is a sequence of independent n x n Wigner matrices. Grenander established that if all the mo- 
ments are finite, then the ESD of n^^/^A„ converges weakly to the semicircular law in probability 
|Gre63j . Arnold showed that, assuming a finite fourth moment, the ESD almost surely converges 
weakly to the semicircular law |Arn71j . Around the same time, Marcenko and Pastur determined 
the form of the limiting spectral distribution of sample covariance matrices |MP67j . 

More recently, Tao and Vu confirmed the long-conjectured circular law hypothesis. Let {Cn\ be 
a sequence of independent nx n matrices whose entries are i.i.d. and have unit variance. Then the 
ESD of n~^/^Cn converges weakly to the uniform measure on the unit disk, both in probability 
and almost surely |TV10b| . 

Although the convergence rate of the ESD has considerable practical interest, it was not until 1993 
that theoretical results became available when Bai showed that for Wigner matrices ^Bai93aj and 
sample covariance matrices |Bai93b| the expected ESDs of n^^/^A„ and n~^B„S*, respectively, 
both converge pointwise at a rate of 0(n~^/^). Later, Bai and coauthors established the pointwise 
convergence in probability of the ESD of the normalized Wigner matrix n~^^'^An |BMT97] and 
greatly improved the convergence rates |BMT99l IBMT021 IBM Y03J . The strongest result to date 
is due to Bai et al., who have shown that, if the entries of the Wigner matrix possess finite sixth 
moments, then pointwise convergence in probability of the ESD of n~^/^A„, occurs at the rate of 
0(n-i/2) iBHPZllj . 

Classically, individual eigenvalues have been studied through the limiting behavior of the ex- 
tremal eigenvalues and the asymptotic joint distribution of several eigenvalues. Much is known 
about the limiting distribution of the largest eigenvalues of Wigner and covariance matrices. Ge- 
man showed that if the columns of S„ are drawn from a sufficiently regular distribution, then the 
largest eigenvalue of the sample covariance matrix n^^-B„S* converges almost surely to a limit 
|Gem80| . Bai, Yin, and coauthors showed that the existence of a fourth moment is both necessary 
and sufficient for the existence of such a limit |YBK88( IBSY88| . They also identified necessary and 
sufficient conditions for the existence of limits for the smallest and largest eigenvalues of a normal- 
ized Wigner matrix 71 ' |BY88b] . El Karoui has recently described the limiting behavior of 
the leading eigenvalues of a large class of sample covariance matrices |E1 07j . 

Less is known about the rate of convergence of the eigenvalues, but some results are available. 
Write the eigenvalues of a self-adjoint matrix A in nonincreasing order Ai > . . . > A„. For 1 < j < ra, 
the classical location 7^ of the jth eigenvalue of the normalized Wigner matrix n~^/'^An is defined 
via the relation 

fij j 
/ psc{x)dx = -, 
J-00 n 

where psc is the density associated with the semicircular law. Intuitively, the facts that " — ^ 

F'"" and F7s"^"(Aj) = j/n suggest that -j^Xj jj. Indeed, it follows from |BY88al IBY88bj that 



Xj = Vnjj + o{^/n) 
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asymptotically almost surely. Under the assumption that the entries exhibit uniform subgaussian 
decay, Erdos, Yau, and Yin have strengthened this result by showing that, up to log factors, 
the eigenvalues of n~^/^A„ are within 0(n~^/^) of their classical position with high probability 
|EYY10j . More generally, Tao and Vu have established the universality of a result due to Gustavsson 
|Gus05j in the complex Gaussian Wigner case: (logn)-V2(^Aj — n7j) is asymptotically normally 
distributed |TVllj . Further, they have shown that eigenvalues in the bulk of the spectrum (j = 
n(n)) of a Wigner matrix satisfy 

E\Xj - \/n7jf = 0(n-^), 

for some universal constant c > |TV10a| . 

In contrast to the asymptotic theory, which remains to a large extent driven by the study of 
particular classes of random matrices, the nonasymptotic theory has developed as a collection 
of techniques for addressing the behavior of a broad range of random matrices. The nonasymp- 
totic theory has its roots in geometric functional analysis in the 1970s, where random matrices 
were used to investigate the local properties of Banach spaces tLM93t ISDOU IVerlOj . Since then, 
the nonasymptotic theory has found applications in areas including theoretical computer science 
|Ach03| |yem04, BSOSj , machine learning |DM05j , optimization |Nem07|. ISoODj , and numerical linear 
algebra fPMlO. .HMTTTl IMahllj . 

As is the case in the asymptotic theory, the sharpest and most comprehensive results available 
in the nonasymptotic theory concern the behavior of Gaussian matrices. The amenability of the 
Gaussian distribution makes it possible to obtain results such as Szarek's nonasymptotic analog of 
the Wigner semicircle theorem for Gaussian matrices |Sza90j and Chen and Dongarra's bounds on 
the condition number of Gaussian matrices |CD05j . The properties of less well-behaved random 
matrices can sometimes be related back to those of Gaussian matrices using probabilistic tools, such 
as symmetrization; see, e.g., the derivation of Latala's bound on the norms of zero- mean random 
matrices |Lat05j . 

More generally, bounds on extremal eigenvalues can be obtained from knowledge of the moments 
of the entries. For example, the smallest singular value of a square matrix with i.i.d. zero-mean 
subgaussian entries with unit variance is 0(n^^/^) with high probability |RV08| . Concentration of 
measure results, such as Talagrand's concentration inequality for product spaces |Tal95| . have also 
contributed greatly to the nonasymptotic theory. We mention in particular the work of Achlioptas 
and McSherry on randomized sparsification of matrices j AMOU lAMOTj , that of Meckes on the norms 
of random matrices [Mec04j . and that of Alon, Krivelevich and Vu |AKV02j on the concentration of 
the largest eigenvalues of random symmetric matrices, all of which are applications of Talagrand's 
inequality. In cases where geometric information on the distribution of the random matrices is 
available, the tools of empirical process theory — such as the generic chaining, also due to Talagrand 
|Tal05] — can be used to convert this geometric information into information on the spectra. One 
natural example of such a case consists of matrices whose rows are independently drawn from a 
log-concave distribution jMPOGj [n7PTJll] . 

The noncommutative Khintchine inequality (NCKI), which bounds the moments of the norm of 
a sum of fixed matrices modulated by random signs |LP86| ILPP91| , is a widely used tool in the 
nonasymptotic theory. Despite its power, the NCKI is unwieldy. To use it, one must reduce the 
problem to a suitable form by applying symmetrization and decoupling arguments and exploiting 
the equivalence between moments and tail bounds. It is often more convenient to apply the NCKI in 
the guise of a lemma, due to Rudelson |Rud99| . that provides an analog of the law of large numbers 
for sums of rank-one matrices. This result has found many applications, including column-subset 
selection |RV07j and the fast approximate solution of least-squares problems |DMMSllj . The NCKI 
and its corollaries do not always yield sharp results because parasitic logarithmic factors arise in 
many settings. 
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The current paper is ultimately based on the influential work of Ahlswede and Winter |AW02| . 
This line of research leads to explicit tail bounds for the maximum eigenvalue of a sum of random 
matrices. These probability inequalities parallel the classical scalar tail bounds due to Bernstein and 
others. Matrix probability inequalities allow us to obtain valuable information about the maximum 
eigenvalue of a random matrix with very little effort. Furthermore, they apply to a wide variety 
of random matrices. We note, however, that matrix probability inequalities can lead to parasitic 
logarithmic factors similar to those that emerge from the NCKI. 

Major contributions to the literature on matrix probability inequalities include the papers [CM08| 
IRec09| IGrollj . We emphasize two works of Oliveira |Oli09| IQlilOj that go well beyond earlier 
research. The sharpest current results appear in the works of Tropp [Trollcl ITrollb|. ITrollaj . 
Recently, Hsu, Kakade, and Zhang |HKZ11| have modified Tropp's approach to establish matrix 
probability inequalities that depend on an intrinsic dimension parameter, rather than the ambient 
dimension. 

1.2. Outline. In section [2] we introduce the notation used in this paper and state a convenient 
version of the Courant -Fischer theorem. In section |3j we use the Courant -Fischer theorem to 
extend the Laplace transform technique from [Trollcj to apply to all the eigenvalues of self-adjoint 
matrices, thereby obtaining the minimax Laplace transform. We apply this technique in sections 
|4] and [5] to develop eigenvalue analogs of the classical Chernoff and Bernstein bounds. The final 
two sections illustrate, using two familiar problems, that the minimax Laplace technique gives us 
significantly more information on the spectra of random matrices than current approaches. In 
section [6j we use the Chernoff bounds to quantify the effects of column sparsification on all the 
singular values of matrices with orthogonal rows. In section [7| we consider the question of how 
fast, in relative error, the eigenvalues of empirical covariance matrices converge. 



2. Background and Notation 

We establish the notation used in the sequel and state a convenient version of the Courant-Fischer 
theorem. 

Unless otherwise stated, we work over the complex field. The kth column of the matrix A is 
denoted by a^, and the entries are denoted ajk or {A)jk. We define M.^^ to be the set of self- 
adjoint matrices with dimension n. The eigenvalues of a matrix A in M^^^ are arranged in weakly 
decreasing order: Amax (A) = Ai(A) > A2(A) > • • • > A„(A) = Amin (A) . Likewise, singular values 
of a rectangular matrix B with rank r are ordered si{B) > S2{B) > ■ ■ ■ > Sr{B). The spectral norm 
of a matrix B is expressed as ||-B||. We often compare self-adjoint matrices using the semidefinite 
ordering. In this ordering, A is greater than or equal to B, written A y B oi B ^ A, when A — B 
is positive semidefinite. 

The expectation of a random variable is denoted by KX. We write X ~ Bern(p) to indicate that 
X has a Bernoulli distribution with mean p. 

One of our central tools is the variational characterization of the eigenvalues of a self-adjoint 
matrix given by the Courant-Fischer theorem. For integers d and n satisfying 1 < d < n, the 
complex Stiefel manifold 

Y]^ = {V £ C"""^ : V*V = 1} 

is the collection of orthonormal bases for the d-dimensional subspaces of C", or, equivalently, the 
collection of all isometric embeddings of into C". Let A be a self-adjoint matrix with dimension 
n, and let V G be an orthonormal basis for a subspace of C". Then the matrix V* can be 
interpreted as the compression of A to the space spanned by V. 
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Proposition 2.1 (Cour ant-Fischer). Let A be a self-adjoint matrix with dimension n. Then 



Afc(A) = min Amax(^*AV) and (2.1) 
Afc(A)= maxA^in(V*AF). (2.2) 



A matrix V- € achieves equality in (2.2) if and only if its columns span a dominant k- 



dimensional invariant subspace of A. Likewise, a matrix V+ e V^.^+i achieves equality in (2.1) if 
and only if its columns span a bottom (n — + 1) -dimensional invariant subspace of A. 



The lb subscripts in Proposition 2.1 are chosen to reflect the fact that Xk{A) is the minimum 



eigenvalue of V*AV- and the maximum eigenvalue of V^AV+. As a consequence of Proposition 



2.1, when A is self-adjoint, Xk{—A) = — A„_fc-|-i(A). This fact allows us to use the same techniques 



we develop for bounding the eigenvalues from above to bound them from below. 

3. Tail Bounds For Interior Eigenvalues 

In this section we develop a generic bound on the tail probabilities of eigenvalues of sums of 
independent, random, self-adjoint matrices. We establish this bound by supplementing the matrix 



Laplace transform methodology of jTrollcj with Proposition 2.1 and a new result, due to Lieb 
and Seiringer |LS05j . on the concavity of a certain trace function on the cone of positive-definite 
matrices. 

First we observe that the Courant -Fischer theorem allows us relate the behavior of the kth 
eigenvalue of a matrix to the behavior of the largest eigenvalue of an appropriate compression of 
the matrix. 

Theorem 3.1. Let X be a random self-adjoint matrix with dimension n, and let k < n be an 
integer. Then, for all t G M, 

P{Afc(X) > t| < inf min le"^* • Etre^'^'^^l . (3.1) 

Proof. Let 9 he a fixed positive number. Then 
P {Afc(X) >t} = F {Xki9X) >et}=F |e^'=(^^) > e^*} 

< e"^* • Ee^'^^^^) = e"^* • E exp J min A^ax (OV*XV) I . 

The first identity follows from the positive homogeneity of eigenvalue maps and the second from 
the monotonicity of the scalar exponential function. The final two relations are Markov's inequality 



and (|2Tj). 

To continue, we need to bound the expectation. Interchange the order of the exponential and 
the minimum; then apply the spectral mapping theorem to see that 

Eexpl min Amax (^F*X V) I = E min Amax (exp(0F*XF)) 

< min EAmax(exp(ey*XF)) 

< min Etrexp(6iy*XF). 

^ n-k + l 

The first inequality is Jensen's. The second inequality follows because the exponential of a self- 
adjoint matrix is positive definite, so its largest eigenvalue is smaller than its trace. 

Combine these observations and take the infimum over all positive 9 to complete the argument. 

□ 



TAIL BOUNDS FOR EIGENVALUES OF RANDOM MATRICES 



We are interested in the case where the matrix X in Theorem 3.1 can be expressed as a sum of 
independent random matrices. In this case, we use the following result to develop the right-hand 
side of the Laplace transform bound (3.1). 

Theorem 3.2. Consider a finite sequence {Xj} of independent, random, self-adjoint matrices with 
dimension n and a sequence {Aj} of fixed self-adjoint matrices with dimension n that satisfy the 
relations 

Ee^^ ^ e^K (3.2) 
Let V G be an isometric embedding of into C" for some k < n. Then 

Etrexp 1^^. V*XjV^ < trexp |^^. V*AjV^ . (3.3) 

In particular, 

Etrexp I^ .Xj j < trexp |^ .Aj| . (3.4) 



Theorem 3.2 is an extension of Lemma 3.4 of |Trollc| . which establishes the special case (3.4). 
The proof depends upon a recent result due to Lieb and Seiringer |LS05| Thm. 3] that extends 
Lieb's earlier result |Lie73| Thm. 6]. 

Proposition 3.1 (Lieb-Seiringer 2005). Let H be a self-adjoint matrix with dimension k. Let 
V € be an isometric embedding of into C" for some k < n. Then the function 

A ^ tr exp {H + V* (log A)V} 

is concave on the cone of positive-definite matrices in Mgg^. 



Proof of Theorem 3.2. First, note that (3.2) and the operator monotonicity of the matrix logarithm 
yield the following inequality for each k: 

logEe^'^^Afc. (3.5) 
Let Ejt denote expectation conditioned on the first k summands, Xi through X^^. Then 

Etrexp I = EEi ••• E^_i trexp |^^.^^_^ + F* (loge^^) 

< EEi • • •E^„2 trexp {^.^^^^ V*XjV + V* (logEe^^) f} 

< EEi • • • E^_2 trexp + V* (loge^*) f} 

= EEi • • • Ee-2 tr exp [Y.j<e-i + . 

The first inequality follows from Proposition |3.1| and Jensen's inequality, and the second depends 



on (3.5) and the monotonicity of the trace exponential. Iterate this argument to complete the 
proof. □ 



Our main result follows from combining Theorem 3.1 and Theorem |3.2[ 

Theorem 3.3 (Minimax Laplace Transform). Consider a finite sequence {Xj} of independent, 
random, self-adjoint matrices with dimension n, and let k < n be an integer. 

(i) Let {Aj} be a sequence of self-adjoint matrices that satisfy the semidefinite relations 
where g : (0, oo) — ?• [0, oo). Then, for all t G M, 
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(ii) Let {A, : 



fc+i 



a sequence of functions that satisfy the semidefinite relations 



for all V G "^'^_k+v "^^^re g : (0,oo) [0,oo). Then, for all t € M, 



mm 

VeV; 



n-fc + 1 



•trexp{5(0)5].A,(F)} 



The first bound in Theorem 3.3 requires less detailed information on how compression affects 
the summands but correspondingly does not give as sharp results as the second. 

In the following two sections, we use the minimax Laplace transform method to derive Chernoff 
and Bernstein inequalities for the interior eigenvalues of a sum of independent random matrices. 
Tail bounds for the eigenvalues of matrix Rademacher and Gaussian series, eigenvalue Hoeffding, 
and matrix martingale eigenvalue tail bounds can all be derived in a similar manner; see [Trollc] 
for relevant details. 



4. Chernoff bounds 



Classical Chernoff bounds establish that the tails of a sum of independent nonnegative random 
variables decay subexponentially. [Trollcj develops Chernoff bounds for the maximum and mini- 
mum eigenvalues of a sum of independent positive-semidefinite matrices. We extend this analysis 
to study the interior eigenvalues. 

Intuitively, the eigenvalue tail bounds should depend on how concentrated the summands are; 
e.g., the maximum eigenvalue of a sum of operators whose ranges are aligned is likely to vary 
more than that of a sum of operators whose ranges are orthogonal. To measure how much a 
finite sequence of random summands {Xj} concentrates in a given subspace, we define a function 



^ ■ Ui<fc<nVfc K that satisfies 

maxj Amax {V*XjV) < ^(F ) almost surely for each V G |Ji<fe<„ 



(4.1) 



The sequence {Xj} associated with ^ will always be clear from context. We have the following 
result. 

Theorem 4.1 (Eigenvalue Chernoff Bounds). Consider a finite sequence {Xj} of independent, 
random, positive-semidefinite matrices with dimension n. Given an integer k <n, define 



and let V-^ € 'V^_i^^-^ and V- € be isometric embeddings that satisfy 



max 



Then 



Y.j^^) ^ (I + '^K} <{n-k + l) 
{Afc (E,^:') ^ 



^^.F*(EXj)VL 



(l-<5) 



1-5 



for 5 > 0, and 
/or 5 G [0,1), 



where ^ is a function that satisfies (4.1). 
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Theorem 4.1 tells us how the tails of the kth eigenvalue are controlled by the variation of the 
random summands in the top and bottom invariant subspaces of ■ Up to the dimensional 

factors k and n—k+1, the eigenvalues exhibit binomial- type tails. When k = 1 (respectively, k = n) 



Theorem 4.1 controls the probability that the largest eigenvalue of the sum is small (respectively, the 
probability that the smallest eigenvalue of the sum is large), thereby complementing the one-sided 
Chernoff bounds of [Trollc] . 

Remark 4.1. If it is difficult to estimate ^{Vj^) or '^{V ), one can resort to the weaker estimates 

< max max,- = max,- 

n — k + l 

^iV ) < max max,- = max,- \\XA\ . 



Theorem 4.1 follows from Theorem 3.3 using an appropriate bound on the matrix moment 
generating functions. The following lemma is due to Ahlswede and Winter |AW02] : see also |Trollc| 
Lem. 5.8]. 

Lemma 4.2. Suppose that X is a random positive-semidefinite matrix that satisfies Amax (X) < 1. 
Then 

Ee^^ exp (^(e^ - 1)(EX)) for 6* G M. 



Proof of Theorem 4-T upper bound. We consider the case where ^'(V+) = 1; the general case fol- 
lows by homogeneity. Define 

Aj{V+) = V*{EXj)V+ and g{e) = e^ - 1. 



Theorem |3.3| jill) and Lemma 4.2 imply that 



F {\k ^j) ^ (1 + ^ inf e-^(i+'^)^'' • tr exp {g{e) . V;{EXj)V+] . 

Bound the trace by the maximum eigenvalue, taking into account the reduced dimension of the 
summands: 

tr exp [g{e) J]^. V;{EXj)V+} < {n - k + 1) ■ A^ax (exp [g{e) J]^. F;(EX,)F+}) 

= (n - A: + 1) • exp {g{e) ■ X^^, . F+*(EX,-)F+) } . 

The equality follows from the spectral mapping theorem. Identify the quantity /x^; then combine 
the last two inequalities to obtain 

P {Xk ( J] . Xj) > (1 + S)fik} <{n-k + l)- Me^aW-'^^+^^^f'K 
The right-hand side is minimized when 9 = log( 1-1-5), which gives the desired upper tail bound. □ 
Proof of Theorem \4- 1\ lower bound. As before, we consider the case where "^(V-) = 1. Clearly, 

IP {Xk ^i) ^ (1 - = ^ (IZ,- -^j) ^ -(1 - ^)f'k} . (4.2) 

Apply Lemma |4.2| to see that, for 6 > 0, 

^^e{-v^x,v.) ^ ^ ^^^^^ . v*{-EXj)V-), 



where g{9) = 1 — e . Theorem 3.3 ii) thus implies that the latter probability in (4.2) is bounded 
by 



inf e^(^-^)^'= • trexp {5(0) Y,. F*(-EXj-)y_} . 



10 A. GITTENS AND J. A. TROPP 

Using reasoning analogous to that in the proof of the upper bound, we justify the first of the 
fohowing inequahties: 

trexp {g{6) V:{-EXj)V.} < A; • exp {An,ax [giO) ^ . V*{-EXj)vJ^ } 

= • exp {-g{e) • A^in F*(EX,)F_) } 
= k ■ exp{-5r(6')^A.} . 

The remaining equahties follow from the fact that —g{9) < and the definition of fi^- 
This argument establishes the bound 

p{Afc ^ <A;.infe[^(i-^)-^(^)l'^^ 

The right-hand side is minimized when 9 = — log(l— 5) , which gives the desired lower tail bound. □ 

5. Bennett and Bernstein inequalities 

The classical Bennett and Bernstein inequalities use the variance or knowledge of the moments of 
the summands to control the probability that a sum of independent random variables deviates from 
its mean. In [Trollcj . matrix Bennett and Bernstein inequalities are developed for the extreme 
eigenvalues of self-adjoint random matrix sums. We establish that the interior eigenvalues satisfy 
analogous inequalities. 

As in the derivation of the Chernoff inequalities of section |4| we need a measure of how concen- 
trated the random summands are in a given subspace. Recall that the function ^' : Ui<fc<n ^fe ~^ ^ 
satisfies 

maxj Amax {V*XjV) < "^{V) almost surely for each V G Ui<fe<„ ^k- (5-1) 
The sequence {Xj} associated with ^ will always be clear from context. 

Theorem 5.1 (Eigenvalue Bennett Inequality). Consider a finite sequence {Xj} of independent, 
random, self-adjoint matrices with dimension n, all of which have zero mean. Given an integer 
k < n, define 

Choose G V^-fc+i satisfy 

al = Xr,,,,(j2.V;E{X])V+ 
Then, for all t > 0, 

F {a. , X,) > t} < („ - t + 1) . exp {--jI^ . k } (i) 

< {n — k + 1) ■ exp • 



-tV2 
al + ^{V+)t/3 

^ (n - A: + 1) • exp {-ftV^^} for t < al/^iV+) ^...^ 
(n-A: + l)-exp{-|t/vI/(F+)} for t > al/^{V+), 



where the function h{u) = (1 + u) log(l + u) — u for u > 0. The function ^ satisfies (5.1 ) above. 

Results ^ and ^ are, respectively, matrix analogs of the classical Bennett and Bernstein 
inequalities. As in the scalar case, the Bennett inequality reflects a Poisson-type decay in the tails 
of the eigenvalues. The Bernstein inequality states that small deviations from the eigenvalues of 
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the expected matrix are roughly normally distributed while larger deviations are subexponential. 



The split Bernstein inequalities (iii) make explicit the division between these two regimes. 



As stated, Theorem 5.1 estimates the probability that the eigenvalues of a sum are large. Using 
the identity 



E, 



-A 



n-fc+l 



Theorem 5.1 can be applied to estimate the probability that eigenvalues of a sum are small. 



To prove Theorem 5.1, we use the following lemma (Lemma 6.7 in jTrollcj ) to control the 
moment generating function of a random matrix with bounded maximum eigenvalue. 

Lemma 5.2. Let X be a random self-adjoint matrix satisfying KX = and Amax (X) < 1 almost 
surely. Then 

Ee^^ < exp((e'' -6-1)- E(X2)) for 6 > 0. 



5.2 



Proof of Theorem 5.1 Using homogeneity, we assume without loss that ^(V+) = 1. This implies 
that Amax {Xj) < 1 almost surely for all the summands. By Lemma 

Ee^'^^^exp {g{e)-E{X])), 

with g{e) = e^ -e-l. 

Theorem |3.3|ji|) then implies 



IP {Xk ^i) ^ ^ III""''' • *^«^P {ai^^ J2j v+Hx])v+} 

<{n-k + l)- inf e-^* • A^ax (exp {g{e) F;E(X|)F+}) 
= {n-k + l)- inf e-^* • exp {g{e) ■ A^ax F;E(X|)F+) } . 
The maximum eigenvalue in this expression equals o"^, thus 

IP {^^ (E, >t}<{n-k + l). inf e^(^)'^^-'*. 

The Bennett inequality ^ follows by substituting 6 = log(l + t/a"^) into the right-hand side and 
simplifying. 

The Bernstein inequality is a consequence of ^ and the fact that 

hiu) > '—— for u> 0, 

^ ^ - l + u/3 - ' 

which can be established by comparing derivatives. 

The subgaussian and subexponential portions of the split Bernstein inequalities (iii) are verified 
through algebraic comparisons on the relevant intervals. □ 

Occasionally, as in the application in section [7] to the problem of covariance matrix estimation, 
one desires a Bernstein-type tail bound that applies to summands that do not have bounded 
maximum eigenvalues. In this case, if the moments of the summands satisfy sufficiently strong 
growth restrictions, one can extend classical scalar arguments to obtain results such as the following 
Bernstein bound for subexponential matrices. 

Theorem 5.3 (Eigenvalue Bernstein Inequality for Subexponential Matrices). Consider a finite 
sequence {Xj} of independent, random, self-adjoint matrices with dimension n, all of which satisfy 
the subexponential moment growth condition 

nXf) ^ "^B^-^^] form = 2,3, A,..., 
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where B is a positive constant and are positive-semidefinite matrices. Given an integer k < n, 



set 



Choose G "^n-k+i ^^^^ satisfies 

^ik = K...{^.v*{¥.x,)v+y 

and define 

Then, for any t >0, 



(i) 



^ . (n-fc + l)-exp{-|tV^fc} fort<al/B ^..^ 
{n-k + l)-eici>{-\t/B] fort>al/B. 



This result is an extension of [Trollcl Theorem 6.2], which, in turn, generahzes a classical scalar 
argument |DG98j . 



As with the other matrix inequalities, Theorem 5.3 follows from an application of Theorem 3.3 



and appropriate semidefinite bounds on the moment generating functions of the summands. Thus, 
the key to the proof lies in exploiting the moment growth conditions of the summands to majorize 
their moment generating functions. The following lemma, a trivial extension of Lemma 6.8 in 
[Trollcj . provides what we need. 

Lemma 5.4. Let X he a random self-adjoint matrix satisfying the suhexponential moment growth 
conditions 

E(X™) ^ /orm = 2,3,4,.... 

Then, for any 9 in [0, 1), 

Eexp(0X) < exp (^KX + ^^ S^) • 



Proof of Theorem 5.3. We note that Xj satisfies the growth condition 



m! 



E(X™) < —B^^-^-E^: for m > 2 
if and only if the scaled matrix Xj / B satisfies 



E 



ml 



for m > 2. 



B J - 2 ^2 

Thus, by rescaling, it suffices to consider the case B = 1. We now do so. 
By Lemma 5.4 the moment generating functions of the summands satisfy 

Eex.p{eXj) ^ exp {OEXj + gie)T,]) , 
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2 , 



where g{d) = 6*7(2 - 26). Now we apply Theorem |3.3||i|): 

^ [^^ (E, ^^■) >^^k + t}< ,Jj,f^) • tr exp [e (EX,) v+ + g{e) v;^v^} 

< inf (n - fc + 1) • exp | - e{^lk + t) + e- A„,ax (V . v; (EXj)y+ 

+ g{9) ■ } 

= inf (n - A; + 1) • exp i-Ot + g{9)al) . 

To achieve the final simplification, we identified and ci^. Now, select = t/{t + (t|). Then 
simplication gives the Bernstein inequality 

Algebraic comparisons on the relevant intervals yield the split Bernstein inequalities □ 



6. An APPLICATION TO COLUMN SUBSAMPLING 

As an application of our Chernoff bounds, we examine how sampling columns from a matrix 
with orthonormal rows affects the spectrum. This question has applications in numerical linear 
algebra and compressed sensing. The special cases of the maximum and minimum eigenvalues 
have been studied in the literature [Tro08| IRVOTj . The limiting spectral distributions of matrices 
formed by sampling columns from similarly structured matrices have also been studied: the results 
of [GHIO] apply to matrices formed by sampling columns from any fixed orthogonal matrix, and 
|FarlO] studies matrices formed by sampling columns and rows from the discrete Fourier transform 
matrix. We mention in particular |Rud99j . the main result of which provides a uniform bound on 
the tails of all singular values of the sampled matrix. The theorem proven in this section provides 
bounds which reflect the differences in the tails of the individual singular values, and thus can be 
viewed as an elaboration of the result in |Rud99) . 

Let ?7 be an n X r matrix with orthonormal rows. We model the sampling operation using a 
random diagonal matrix D whose entries are independent Bern(p) random variables. Then the 
random matrix 

U = UD (6.1) 

can be interpreted as a random column submatrix of U with an average of pr nonzero columns. 
Our goal is to study the behavior of the spectrum of U. 

Recall that the jth column of U is written Uj. Consider the following coherence- like quantity 
associated with U : 

Tk = min max,- ||V*it,|P for k = 1, . . . ,n. (6.2) 

There does not seem to be a simple expression for Tfe. However, by choosing V* to be the restriction 
to an appropriate /c- dimensional coordinate subspace, we see that Tfe always satisfies 



Tk < min max,- >^ u?- 
~ \l\<k ^ ^idl 



The following theorem shows that the behavior of Sk{U), the kth. singular value of U, can be 
explained in terms of r^. 



Theorem 6.1 (Column Subsampling of Matrices with Orthonormal Rows). Let U be an n x r 

matrix with orthonormal rows, and let p be a sampling probability. Define the sampled matrix U 
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according to (6.1), and the numbers {r^} according to (6.2). Then, for each k = 1, . . . ,n 

V ^5 -\P/rn-k+l 

'{sk{U) > v/(l + <5)p} <{n-k + l) 



k- 



-iP/Tk 



(1-6) 



1-5 



for 6 > 
for S€ [0,1). 



Proof. Observe, using (6.1), that 



Z — jj 



where Uj is the jth column of U and dj ~ Bern(j)). Compute 

Atfc = Afe (^'^. EdjUjU*^ = p ■ \k{UU*) = p ■ Xkil) = p. 



It follows that, for any V G V" 



n-fc+l' 



At, 



. V*{EdjU,u*)v'^ = p ■ A^ax {V*V) =p = fik, 



so the choice of "V+ G "^n-k+i arbitrary. Similarly, the choice of V- G is arbitrary. We select 
V+ to be an isometric embedding that achieves Tn-k+i and V- to be an isometric embedding that 
achieves Tfc. Accordingly, 

^{V+) = maxj \\VlujU*V+\ 



maxj ||Vj*Ujii*VL 



maxj ||V^Mj|p 
maxj ||V[*Mj|p 



Tn-k+l^ 
Tk- 



and 



UjUj 



>(! + 



k + 1) 



< (1 



5)p] 



<k- 



(1 + 5) 



P/Tk 



1+5 



p/t, 



n-fc + l 



Theorem 4.1 delivers the upper bound 
IP {sk{U) > V(lTS)p} =F{Xk dj 
for 5 > and the lower bound 

P {sk{U) < - S)p} = IP {Afc 
for 6 G [0,1). 

To illustrate the discriminatory power of these bounds, let ?7 be an n x matrix consisting of n 
rows of the x Fourier matrix and choose p = (logn)/n so that, on average, sampling reduces 
the aspect ratio from n to log n. For n = 100, we determine upper and lower bounds for the me dian 
value of Sk{U) by numerically finding the value of 6 where the probability bounds in Theorem 6.1 



(l-S) 



1-5 



□ 



equal 1/2. Figure [T] plots the empirical median value along with the computed interval. We see that 
these ranges reflect the behavior of the singular values more faithfully than the simple estimates 
SkiEU)=p. 

7. CovARiANCE Estimation 

We conclude with an extended example that illustrates how this circle of ideas allows one to 
answer interesting statistical questions. Specifically, we investigate the convergence of the individual 
eigenvalues of sample covariance matrices, with errors measured in relative precision. 

Covariance estimation is a basic and ubiquitious problem that arises in signal processing, graph- 
ical modeling, machine learning, and genomics, among other areas. Let {rjj}"^-^ C be i.i.d. sam- 
ples drawn from some distribution with zero mean and covariance matrix C. Define the sample 
covariance matrix 

^ 1 ^—\n 

Cn = - > . ^ VjVj- 
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0.15 




Figure 1 . [Spectrum of a random submatrix] The matrix U is a 10^ x 10^ submatrix 



of the unitary DFT matrix with dimension 10^ 
10~^ log(lO^). The feth vertical bar, calculated using Theorem 6.1 



and the sampli ng pr obability p = 
describes an 



interval containing the median value of the kth singular value of the sampled matrix 
U. The black circles denote the empirical medians of the singular values of U, 
calculated from 500 trials. The gray circles represent the singular values of KU. 



An important challenge is to determine how many samples are needed to ensure that the empirical 
covariance estimator has a fixed relative accuracy in the spectral norm. That is, given a fixed e, 
how large must n be so that 

||C„ - C|| < e||C||? (7.1) 

This estimation problem has been studied extensively. It is now known that for distributions with 
a finite second moment, O(plogp) samples suffice |Rud99] . and for log-concave distributions, 0,{p) 
samples suffice [ALPTJlT] . More broadly, Vershynin |Verllj conjectures that, for distributions 
with finite fourth moment, ^{p) samples suffice; he establishes this result to within iterated log 
factors. In |SV11] . Srivastava and Vershynin establish that ^{p) samples suffice for distributions 
which have finite 2 + e moments, for some e > 0, and satisfy an additional regularity condition. 



Inequality (7.1) ensures that the difference between the feth eigenvalues of Cn and C is small, 
but it requires 0{p) measurements to obtain estimates of even a few of the eigenvalues. Specifically, 
letting Ki = Xi{C)/ Xe{C), we see that 0{e~'^K'jp) measurements are required to obtain relative- 
error estimates of the dominant i eigenvalues of C using the results of jALPTJlH IVerlH ISVll] . 
However, it is reasonable to expect that when the spectrum of C exhibits decay and i <^ p, 
much fewer than 0{p) measurements should suffice for relative-error recovery of the dominant i 
eigenvalues. 

In this section, we derive a relative approximation bound for each eigenvalue of C that allows 
us to confirm this intuition. For simplicity we assume the samples are drawn from a M{0, C) 
distribution where C is full-rank, but the arguments can be extended to cover other distributions. 

Theorem 7.1. Assume that C G Mfa is positive definite. Let {J7j}"=i C W be i.i.d. samples drawn 
from a M{0, C) distribution. Define 



^ 1 ^— \n 

Cn = - > . ^ Vjll]- 
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Write Afc for the kth eigenvalue of C , and write Xk for the kth eigenvalue of Cn- Then for k 
1, . . . ,p, 

IP {Afc > Afc + t} < (p - A: + 1) • exp (^j-^^^ for t < 4nAfc, 



and 



IP {Afe < Afc - t| < /c • exp ( ) fort< 4nAi, 

where the constant c is at least 1/32. 

The following corollary provides an answer to our question about relative error estimates. 
Corollary 7.2. Let A^ and Xk be as in Theorem 



7.1 



Then 

.2 



^ ^ / — ens \ 

Afc > (1 + e)Afe| < - /c + 1) • exp I ^ ^ J for e < 4n, 



and 

-2 



^ / — ens \ 

Afc<(l-£)Afc|<fc-exp K ^ for EG [0,1], 

where the constant c is at least 1/32. 



The first bound in Corollary 7.2 tells us how many samples are needed to ensure that A^ does 
not overestimate A^. Likewise, the second bound tells us how many samples ensure that A^ does 
not underes tima te A^. 

Corollary 7.2 suggests that the relationship of A^ to A^. is determined by the spectrum of C in 
the following manner. When the eigenvalues below A^ are small compared with A^, the quantity 



is small, and so A^ is not likely to overestimate Xk- Similarly, when the eigenvalues above A^ are 
comparable with A^, the quantity 

E -=1 ^^/^^ 

is small, and so A^ is not likely to underestimate A^. 

We now have everything needed to establish Theorem [TTT] 

Proof of Theorem \l.l\ from Corollary 7.2. From Corollary |7.2[ we see that 



Afc < (1-e 



)Afc|<p"^ when n > 32e~2 f ^ V ^ ] (log ^ + logp). 

> \Xk^i<kXkJ 

Recall that = Xi{C)/Xk{C). Clearly, taking n = Q.{e~'^ K^^ilogp) samples ensures that, with high 
probability, each of the top i eigenvalues of the sample covariance matrix satisfies A^ > (1 — e)Xk- 
Likewise, 

lP{Afc > (l + e)Afc} <p-^ when n >?,2£-'^ {\og{p - k + I) + plogp). 

Assuming the stated decay condition, that 
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we see that taking n = il{e^'^{£ + k^) logp) samples ensures that, with high probabihty, each of the 
top i eigenvalues of the sample covariance matrix satisfies < (1 + e)Afc. 

Combining these two results, we conclude that n = $7(e~^K|^logp) ensures that the top i eigen- 
values of C are estimated to within relative precision lie. □ 



Remark 7.1. The results in Theorem 7.1 and Corollary 7.2 also apply when C is rank-deficient: 
simply replace each occurence of the dimension p in the bounds with rank(C). 



7.1. Proof of Theorem 7.1 , We now prove Theorem 7.1 This result requires supporting lemmas; 
we defer their proofs until after a discussion of extensions to Theorem |7.1[ 

We study the error |Afc(C'„) — Afc(C)|. To apply the methods developed in this paper, we pass 
to a question about the eigenvalues of a difference of two matrices. The first lemma accomplishes 
this goal by compressing both the population covariance matrix and the sample covariance matrix 
to a fixed invariant subspace of the population covariance matrix. 

Lemma 7.3. Let X be a random self-adjoint matrix with dimension p, and let A be a fixed self- 
adjoint matrix with dimension p. Choose W+ G Vp_^^-^ and W G for which 

Afc(A) = A„,ax {WIAW+) = A„,in {WIAW.) . 

Then, for all t > 0, 

F{XkiX)>Xk{A) + t}<F{X^,^{w;XW+) >Xk{A) + t} (7.2) 

and 

¥{Xk{X) < Xk{A) -t}<F {A„,ax {W1{-X)W-) > -Afc(A) + t} . (7.3) 
We apply this result with A = C and X = Cn- Because C„ is unbounded, we apply Theorem 



5.3 to handle the estimates in (7.2) and (7.3). To use this theorem, we need the following moment 



growth estimate for rank-one Wishart matrices. 

Lemma 7.4. Let $, ~ M{0, G). Then for any integer m > 2, 

IE i^Cr ^ 2™m!(tr G)'"-^ • G. 
With these preliminaries addressed, we prove Theorem 1 7. 1[ 



Proof of upper estimate. First we consider the probability that A^ overestimates A^. Let W+ G 
satisfy 

Afc(C) = A^ax {W;CW+) . 

Then Lemma |7.3| implies 



' {Afe(C„) > Afc(C) + t} < P {A,nax (w*CnW+) > Afc(C) + t} 

; . Wl{vjVj)W+) > nXkiC) + nt] . (7.4) 



Am ax ( ^ ^ . 



The factor n comes from the normalization of the sample covariance matrix. 

The covariance matrix of ijj is C, so that of W^r/j is W^CW+. Apply Lemma 7.4 to verify that 
W^rijrijW+ satisfies the subexponential moment growth bound required by Theorem 5.3 with 

B = 2tT{WlCW+) and ^ 8 tr(W"|CVF+) • Vr;CW^+. 

In fact, W^CW+ is the compression of C to the invariant subspace corresponding with its bottom 
p — k + 1 eigenvalues, so 

B = 2Yf.^^^i{C) and A„,ax (5]^) = 8Afc(C) J^^^^ A,(C). 
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We are concerned with the maximum eigenvalue of the sum in (7.4), so we take V+ = I in the 
statement of Theorem 15.31 to find that 



= Xm.. [Y.j = (^i) = 8'^Afc(C) Y!'^^^ ^^(C) and 



It follows from the subgaussian branch of the split Bernstein inequality of Theorem |5.3| that 



' {A„,ax WU^J'nj)W+) > nXk{C) + nt} <{p-k + l)- exp 



32A,(C)Er=,A,(C) 



when t < AnXk{C). This provides the desired bound on the probability that Xk{Cn) overestimates 
Afc(C). □ 

Proof of lower estimate. Now we consider the probability that underestimates A^. The proof 
proceeds similarly to the proof of the upper estimate. Let W G satisfy 

Xk{C) = A^in {WICW.) . 

Then Lemma |7.3| implies 

P{Afc(C„) < Afc(C) -t} < p{A^ax (wi{-Cn)W-) > -nXk{C)+nt} 

= IP {An,ax (J2j Wl{-r]jr]*)W.^ > -nAfc(C) + nt} (7.5) 

The factor n comes from the normalization of the sample covariance matrix. 

The covariance matrix of r/j is C, so that of Wlijj is WICW . Apply Lemma 7.4 to verify 
that for any integer m > 2, 

E{Wl{-r]jr]*)W^)"' ^ E{Wl'nj'n*W^)"' ^ 2™m! tr(Vr*CTy_)'"~^ • WICW . 

Thus, Wl{—rjjr]j)W- satisfies the subexponential moment growth bound required by Theorem 
15.31 with 

B = 2tv{WlCW^) and T,] = 8tv{WlCW^) ■ W1CW-. 

In fact, WICW- is the compression of C to the invariant subspace corresponding with its top k 
eigenvalues, so 

^ = A,(C) and A^ax {^]) = 8Ai(C) V'' ^ A,(C). 



We are concerned with the maximum eigenvalue of the sum in ( |7.5[ ), so we take V+ = I in the 
statement of Theorem 15.31 to find that 



(tI = Ai„ax ^) = nArnax (S?) = 8nAi(C) Y,]^^ Ai(C) and 
/"I = A„,ax (y^Wl¥.{-r^j'n*)W.) = nA^ax {Wl{-C)W-) = -nXk{C). 



It follows from the subgaussian branch of the split Bernstein inequality of Theorem |5.3| that 



{An,ax Wl{-'nj'n*)W.) > -nXkiC) + nt}<k- exp ^— 



-nt^ 



when t < 4nAi(C). This provides the desired bound on the probability that Xk{Cn) underestimates 
Afc(C). □ 
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7.2. Extensions of Theorem 7.1 , Results analogous to Theorem 7.1 can be established for other 



distributions. If the distribution is bounded, the possibility that deviates above or below 



can be controlled using the Bernstein inequality of Theorem 5.1 If the distribution is unbounded 
but has matrix moments that satisfy a sufficiently nice growth condition, the probability that A^ 
deviates below A^ as well as the probability that it deviates above A^ can be bounded using a 



Bernstein inequality analogous to that in Theorem 5.3 



Theorem 7.1 controls the error in the kth sample eigenvalue in terms of all the eigenvalues of 
the covariance matrix, so it is most useful when the eigenvalues of the covariance matrix satisfy 
decay conditions such as those given in the statement of Theorem |1.1[ If such conditions are not 
satisfied, the results of |ALPTJ11| on the convergence of empirical covariance matrices of isotropic 
log-concave random vectors lead to tighter bounds on the probabilities that A^ overestimates or 
underestimates A^. 

To see the relevance of the results in [ALPTjTl] , first observe the following consequence of the 
subadditivity of the maximum eigenvalue mapping: 

Amax {W^X - A)W+) > A„,ax {WlXW+) - A^ax {WlAW+) 

= Xra.^ (WlXW+)-\kiA). 



In conjunction with (7.2), this gives us the following control on the probability that Xk{X) overes- 
timates Afc(A) : 

P{Afc(X) > Afc(A) + t}<F {A^ax {W;{X - A)W+) > t} . 

In our application, X is the empirical covariance matrix and A is the actual covariance matrix. 
The spectral norm dominates the maximum eigenvalue, so 

IP {Afc(C„) > Xk{C) + t} < P {A„,ax (wiiCn - C)VF+) > t} 

<p|||vr_;(c„-c)w+|| >t} =p|||w|c„vr+-52|| >t|, 

where S is the square root of W^CW+. Now factor out S'^ and identify Afc(C) = \\S'^\\ to obtain 

IP{Afc(C) > Afc(C) + t} <F\^\\S~^WldnW+S-^ -I\\\\S^\\ >t} 

= F {\\S-'W*CnW+S~' > t/XkiC)}. 

Note that if rj is drawn from a AA(0, C) distribution, then the covariance matrix of the transformed 
sample S~^W^rj is the identity: 

E {s-^w*'nr]*w+s-^) = s-^w^cw+s-^ = I. 

Thus S-^W^lCnW+S-'^ is the empirical covariance matrix of a standard Gaussian vector in 
j^p-fc+i_ By Theorem 1 of [ALPTJTT] . it follows that A^ is unlikely to overestimate A^ in rela- 
tive error when the number n of samples is Q{p — k + 1). A similar argument shows that A^ is 
unlikely to underestimate A^ in relative error when n = Q{Kpk). 

Similarly, for more general distributions, the bounds on the probability of A^ overestimating or 
underestimating A^ can be tightened beyond those suggested in Theorem |7.1| by using the results in 
|ALPTJTT ] or jVerll]. Note, however, that one cannot use knowledge of spectral decay to sharpen 
the results obtained from |ALPTjTl] and |Verllj into estimates like those given in Theorem |1.1[ 



Finally, we note that the techniques developed in the proof of Theorem 7.1 can be used to 
investigate the spectrum of the error matrices Cn — C. 
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7.3. Proofs of the supporting lemmas. We now establish the lemmas used in the proof of 
Theorem 17. 1[ 



Proof of Lemma 7.3. The probability that \k{X) overestimates Afc(A) is controlled with the se- 
quence of inequalities 



F{Xk{X)>Xk{A) + t}=F{ inf X^,^{W*XW) > Xk{A) + t} 

< IP {A^ax > Xk{A) + t} . 

We use a related approach to study the probability that Xk{X) underestimates Xk{A). Our choice 
of W- implies that 



'{Afc(X) < Xk{A)-t} 



max Amin iW*XW) < Afc(A) - t 



< P{A„,in {W1XW-) < Xk{A) - t} 

= F{Xrn..{Wl{-X)W-) > -Xk{A) + t}. 

This establishes the bounds on the probabilities of Xk{X) deviating above or below Xk{A). □ 



Proof of Lemma \7.4\ Factor the covariance matrix of ^ as G = UAU* where U is orthogonal and 
A = diag(Ai, . . . , Xp) is the matrix of eigenvalues of G. Let 7 be a M{0, Ip) random variable. Then 
^ and are identically distributed, so 



= UA^/^E [(7*A7)'^-S7*] A^/^U*. 
Consider the (i, j) entry of the bracketed matrix in ( |7.6[ ): 

E [(7*A7)'"-S.7,] = IE [(ELi 



(7.6) 



(7.7) 



From this expression, and the independence of the Gaussian variables {71}, we see that this matrix 
is diagonal. 



E [(7*A7)"^-Sf] = V 



XY ■ ■ ■ X/E 



2^1 2£p 2 

7i • • • 7p 1i 



To bound the diagonal entries, use a multinomial expansion to further develop the sum in (7.7) 
for the (i, i) entry: 

/ m — 1 

Denote the norm of a random variable X by 

l|x||^ = (E|x^)^/^ 

Since £1, . . . ,£p are nonnegative integers summing to m — 1, the generalized AM-GM inequality 
justifies the first of the following inequalities: 



m 



2m 



2m 
2m 



1 



< - Il7. 



1 2m 



P 

l + + 2m 



2m 



m 



)2m 
\\9fz=n9 
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The second inequahty is the triangle inequality for norms. Now we reverse the multinomial 
expansion to see that the diagonal terms satisfy the inequality 



-1 ) • • • ) ^-p 



Af • • • Xp^ng 



2m\ 



= (Ai + . . . + Ap)"-^E(^/2m^ ^ tr(G)'"~^E(^/2™) 
Estimate E(5f^'") using the fact that r(x) is increasing for x > 1 : 

nm nm nm 

E (^2™) = r(m + 1/2) < — r(m + 1) = -^m\ for m > 1. 
vvr vvr VTT 



I. 



Combine this result with (|7.8l) to see that 



E [(7*A7)'"-S7* 



^ ^m!tr(G) 



m— 1 



Complete the proof by using this estimate in (7.6). 



(7. 



□ 
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